My goals

  1. Use plotly instead of ggplot2 for figures.
  2. No more than 3 hours from getting the data to generating the final report.

Later: repeat in Python.

Load libraries

Read the data

Link to the file.

## Parsed with column specification:
## cols(
##   number = col_double(),
##   game = col_character(),
##   release_date = col_character(),
##   price = col_double(),
##   owners = col_character(),
##   developer = col_character(),
##   publisher = col_character(),
##   average_playtime = col_double(),
##   median_playtime = col_double(),
##   metascore = col_double()
## )

Clean up the variables

game

Review missing values:

Number of rows where game name is missing
total
3
Rows with missing game names
number game release_date price developer publisher average_playtime median_playtime metascore min_owners max_owners
808 NA 2016-11-01 NA NA Aterdux Entertainment 0 0 NA 20000 50000
138 NA 2017-02-17 8.99 Jeroen Wimmers Jeroen Wimmers 0 0 76 0 20000
3945 NA 2018-03-27 0.99 Paleno Games Paleno Games 0 0 NA 0 20000

Remove missing values for games

developer and publisher

Are there any missing both developer and publisher columns?

Number of rows missing developer and publisher names
total
47

Preview the data after the cleaning steps

Only the first 10 rows because the dataset is large, see below results of skimr::skim()

Generate variable summary

Skim summary statistics
n obs: 26685
n variables: 11

Variable type: character
variable missing complete n min max empty n_unique
developer 150 26535 26685 1 237 0 16751
game 0 26685 26685 1 212 0 26610
publisher 95 26590 26685 1 131 0 13952
Variable type: Date
variable missing complete n min max median n_unique
release_date 42 26643 26685 2004-01-11 2018-12-31 2017-06-09 2942
Variable type: numeric
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
average_playtime 9 26676 26685 9.06 117.94 0 0 0 0 5670 ▇▁▁▁▁▁▁▁
max_owners 0 26685 26685 211169.95 1944091.17 20000 20000 20000 50000 2e+08 ▇▁▁▁▁▁▁▁
median_playtime 12 26673 26685 5.16 68.92 0 0 0 0 3293 ▇▁▁▁▁▁▁▁
metascore 23836 2849 26685 71.89 11.02 20 66 73 80 98 ▁▁▁▂▅▇▆▁
min_owners 0 26685 26685 91277.87 950467 0 0 0 20000 1e+08 ▇▁▁▁▁▁▁▁
number 0 26685 26685 2904.34 2377.1 1 821 2356 4524 8846 ▇▅▃▃▂▂▁▁
price 3094 23591 26685 8.95 12.16 0.49 2.99 5.99 9.99 595.99 ▇▁▁▁▁▁▁▁

Some game names are found multiple times:

Spot check for one game on Steam’s website did show different release dates. Here are the screenshots for “18 Wheels of Steel: American Long Haul” from the browsing page (release date is Jan 29, 2008)

browse

browse

and from the game page (release date is Dec 3, 2007).

browse

browse

Price over time

## Warning: Ignoring 3126 observations

Seems like there is a trend in terms of the price but not huge. We do observe a few outliers.

Developer and publishers with the most games

Select only the developers that put out at least 15 games for a smaller plot.

Similar approach to the publishers, but change the game minimum to 50.

Hey! Big Fish Games!

How many developers are also publishers and how many games do they have?

Most owned games

Ownership will be considered as the max number of owners.

Dota2 is the clear winner. My husband commented that it is also a free to play game which will definitely contribute the the overall number of people owning it.

Are the most owned games also have the highest scores?

## Warning: Ignoring 30 observations

Not necessarily! Again, tapping into my husband’s knowledge of video games, he explained that some of the games weren’t available through Steam right away.

Many hardcore players?

Compare average and median time played for each game in the last two weeks (do remove the missing values and 0)

The points below the diagonal line have average larger than median which means that a lot more people don’t spend too much time playing the game; the points above the diagonal line have median larger than the average which means that those games have a few very hard core players (they spend a lot of time playing the game).

Conclusions

I have certainly not explored all information in the data. Of course, it would be worthwhile looking at the date of release and the number of minutes playes. Are older games still holding up?

From conversation with my husband it seems that it is valuable to have some domain knowledge about this area and therefore additional variables. For example, information about where the game was realeased first (console?), whether the game is free to play and how many players play a certain game.

For this first time I definitely spent less time had I used ggplot2 library instead of plotly, however, this wasn’t the goal.

Problems with plotly

I have observed a strange behavior with plotly when I was trying to plot the top 100 games owned by most people and order the x axis by the maximum number of owners:

Why in this case the x axis labels are cut off and the first 2 values in the x axis are 200,000,000? I also seem to replicate the bahavior when I use categoryarray = ~min_owners, but then I have 100,000,000. I think this is a reproducible example that can be posted as a question on stack overflow.